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Abstract 

' High dimensional statistical problems arise from diverse fields of scientific research 
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likelihood methods have been successfully developed over the last decade to cope with 
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1 Introduction 



High dimensional data analysis has become increasingly frequent and important in diverse 
fields of sciences, engineering, and humanities, ranging from genomics and health sciences 
to economic s, finance and machine learning. It char acterizes many contemporary problems 
in statistics (jHastie. Tibshirani and FriedmanI (|2009l )). For example, in disease classification 
using microarray or proteomics data, tens of thousands of expressions of molecules or ions 
are potential predictors; in genowide association studies between genotypes and phenotypes, 
hundreds of thousands of SNPs are potential covariates for phenotypes such as cholesterol 
levels or heights. When interactions are considered, the dimensionality grows quickly. For ex- 
ample, in portfolio allocation among two thousand stocks, it involves already over two million 
parameters in the covariance matrix; interactions of molecules in the above examples result in 
ultra-high dimensionality. To be more precise, throughout the paper ultra-high dimensional- 
ity refers to the case where the dimensionality grows at a non-polynomial rate as the sample 
size increases, and high dimensionality refers to the general case of growing dimensionality. 
Other examples of high dimensional data include high-resolution images, high-frequency fi- 
nancial d^;taj_e-conimerce data, warehouse data, functional, and longitudinal data, among 
others. iDonohd (120001 ) convincingly demonstrates the need for developments in high dimen- 
sional data analysis, and presents the curses and blessings of dimensionality. iFan and Li 
(1200a) give a comprehensive overview of statistical challenges with high dimensionality in a 
broad range of topics, and in particular, demonstrate that for a host of statistical problems, 
the model parameters can be estimated as well as if the best model is known in advance, 
as long as the dimensionality is not excessively high. The challenges that are not present in 
smaller scale studies have been reshaping statistical thinking, methodological development, 
and theoretical studies. 

Statistical accuracy, model interpretability, and computational complexity are three im- 
portant pillars of any statistical procedures. In conventional studies, the number of observa- 
tions n is much larger than the number of variables or parameters p. In such cases, none of 
the three aspects needs to be sacrificed for the efficiency of others. The traditional methods, 
however, face significant challenges when the dimensionality p is comparable to or larger 
than the sample size n. These challenges include how to design statistical procedures that 
are more efficient in inference; how to derive the asymptotic or nonasymptotic theory; how 
to make the estimated models interpretable; and how to make the statistical procedures 
computationally efficient and robust. 

A notorious difficulty of high dimensional model selection comes from the collinearity 
among the p r edicto rs. The collinearity can easily be spurious in high dimensional geometry 
(jFan and Lvl ()2008l )). which can make us select a wrong model. Figured] shows the maximum 
sample correlation and multiple correlation with a given predictor despite that predictors are 
generated from independent Gaussian random variables. As a result, any variable can be 
well- approximated even by a couple of spurious variables, and can even be replaced by them 
when the dimensionality is much higher than the sample size. If that variable is a signature 
predictor and is replaced by spurious variables, we choose wrong variables to associate the 
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Figure 1: Distributions (left panel) of the maximum absolute sample correlation coefRcient 
max2<j<p |corr(Z'i, Zj)|, and distributions (right panel) of the maximum absolute multiple corre- 
lation coefficient of Z\ with 5 other variables (max|5|=5 |corr(Zi, Z^^Qg)], where 3s is the regression 
coefficient of Z\ regressed on Z5, a subset of variables indexed by S and excluding Zi), computed 
by the stepwise addition algorithm (the actual values are larger than what are presented here), when 
n = 50, p = 1000 (solid curve) and p = 10000 (dashed), based on 1000 simulations. 



covariates with the response and, even worse, the spurious variables can be independent of 
the response at population level, leading to completely wrong scientific conclusions. Indeed, 
when the dimensionality p is large, intuition might not be accura te. This is also exemplified 
by the data piling problems in high dimensional space observed in lHall. Marron and Neeman 
(|2005l ). Collinearity also gives rise to issues of over-fitting and model mis-identification. 

Noise accumulation in high dimensional prediction has long been recognized in statistics 
and computer sciences. Explicit characterization of this is well-known for high dimensional 
regression problems. The qua ntification of t he im pact of dimensionality on classification 
was not well understood until I Fan and FanI (120081). who give a simple expressio n on how 
dimensionality impacts misclassification rates. iHall. Pittelkow and GhoshI (j2008l ) study a 
similar problem for distanced based-classifiers an d observe implicitly the adverse impact of 
dimensionality. As shown in iFan and FanI ([2003), even for the independence classification 
rule described in Section 14.21 classification using all features can be as bad as a random 
guess due to noise accumulation in estimating the population centroids in high dimensional 
feature space. Therefore, variable selection is fundamentally important to high dimensional 
statistical modeling, including regression and classification. 

What makes high dimensional statistical inference possible is the assumption that the 
regression function lies in a low dimensional manifold. In such cases, the p-dimensional 
regression parameters are assumed to be sparse with many components being zero, where 
nonzero components indicate the important variables. With sparsity, variable selection can 
improve the estimation accuracy by effectively identifying the subset of important predictors 
and can enhance the model interpretability with parsimonious representation. It can also 
help reduce the computational cost when sparsity is very high. 

This notion of sparsity is in a narrow sense. It should be understood more widely in 
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transformed or enlarged feature spaces. For instance, some prior know ledge may lea,d us to 
apply some grouping or transformation of the input variables (see, e.g., iFan and Lvl (|2008l )). 
Some transformation of the variables may be appropriate if a significant portion of the 
pairwise correlations are high. In some cases, we may want to enlarge the feature space by 
adding interactions and higher order terms to reduce the bias of the model. Sparsity can also 
be viewed in the context of dimensionality reduction by introducing a sparse representation, 
i.e., by reducing the number of effective parameters in estimation. Exar nples include the 
use of a factor model for high dimensional covariance matrix estimation in iFan. Fan and Lv 

Sparsity arises in many scientific endeavors. In genomic studies, it is generally believed 
that only a fraction of molecules are related to biological outcomes. For example, in disease 
classification, it is commonly believed that only tens of genes are responsible for a disease. 
Selecting tens of genes helps not only statisticians in constructing a more reliable classifi- 
cation rule, but also biologists to understand molecul ar mechanisms. In contrast, p opula r 
but naive methods us e d in microarray data analy si s (IDudoit. Shaffer and Boldrickl (j2003l ): 
Storey and Tibshiranil (j2003l ): iFan and RenI (120061 ): lEfronl (j2007l )) rely on two-sample test s 
to pick important genes, which is truly a marginal correlation raii k ing (|Fan and Lvl (120081 )) 
and can miss important signature genes (IFan. Samworth and Wul (1200911). The main goals 
of high dimensional regression and classification, according to iBickell (120081 ). are 



• to construct as effective a method as possible to predict future observations; 

• to gain insight into the relationship between features and response for scientific pur- 
poses, as well as, hopefully, to construct an improved prediction method. 

The former appears in problems such as text and document classification or portfolio opti- 
mization, whereas the latter appears naturally in many genomic studies and other scientific 
endeavors. 



As pointed out in lFan and Lil (120061 ). it is helpful to differentiate two types of statistical 
endeavors in high dimensional statistical learning: accuracy of estimated model parameters 
and accuracy of the expected loss of the e stima ted model. The la tter property is called 
persistence in iGreenshtein and Ritovl (|2004l ) and iGreenshteinI (j2006l ). and arises frequently 
in machine learning problems such as document classification and computer vision. The 
former appears in many other contexts where we want to identify the significant predictors 
and characterize the precise contribution of each to the response variable. Examples include 
health studies, where the relative importance of identified risk factors needs to be assessed for 
prognosis. Many of the existing results in the literature have been concerned with the study 
of consistency of high dimensional variable selection methods, rather than characterizing 
the asymptotic distributions of the estimated model parameters. However, consistency and 
persistence results are inadequate for understanding uncertainty in parameter estimation. 

High dimensional variable selection encompasses a majority of frontiers where statistics 
advances rapidly today. There has been an evolving literature in the last decade devoted to 
understanding the performance of various variable selection techniques. The main theoretical 
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questions include determining the limits of the dimensionality that such methods can handle 
and how to characterize the optimality of variable selection procedures. The answers to the 
first question for many existing methods were largely unknown until recently. To a large 
extent, the second question still remains open for many procedures. In the Gaussian linear 
regression model, the case of orthonormal design reduces to the problem of Gaussian mean 
estimation, as do the wavelet settings where the design matrices are orthogonal. In such 
cases, the risks of various shrinkage estirnators and their optimality have been extensively 



studied. See, e.g.. lDonoho and Johnstone! (jl994l ) and I Antoniadis and FanI (j2001 



In this article we address the issues of variable selection for high dimensional statisti- 
cal modeling in the unified framework of penalized likelihood estimation. It has been widely 
used in statistical inferences and machine learning, and is basically a moderate scale learning 
technique. We also give an overview on the techniques for ultrahigh dimensional screening. 
Combine d iteratively with large scale scr eening, it can handle problems of ultra-high dimen- 
sionality ([Fan. Samworth and Wul (j2009l )). This will be reviewed as well. 

The rest of the article is organized as follows. In Section [21 we discuss the connections 
of penalized likelihood to classical model selection methods. Section [3] details the methods 
and implementation of penalized likelihood estimation. We review some recent advances in 
ultra-high dimensional variable selection in Section HI In Section \5\ we survey the sampling 
properties of penalized least squares. Section [6] presents the classical oracle property of 
penalized least squares and penalized likelihood methods in ultra-high dimensional space. 
We conclude the article with some additional remarks in Section [71 



2 Classical model selection 



Suppose that the available data are (x^, yOiLi' where is the i-th observation of the response 
variable and Xj is its associated p-dimensional covariates vector. They are usually assumed 
to be a random sample from the population (X^, Y), where the conditional mean of Y given 
X depends on the linear predictor /3"^X with /3 = • • • , Pp)"^ ■ In sparse modeling, it is 
frequently assumed that most regression coefficients f3j are zero. Variable selection aims to 
identify all important variables whose regression coefficients do not vanish and to provide 
effective estimates of those coefficients. 

More generally, assume that the data are generated from the true density function fg^ 
with parameter vector Oq = {6i, - ■ ■ , Od)"'" ■ Often, we are uncertain about the true density, 
but more certain about a larger family of models Jq^ in which 6q is a (nonvanishing) subvector 
of the p-dimensional parameter vector Oi. The problems of how to estimate the dimension 
of the model and compare models of different dimensions naturally arise in many statistical 
applications, including time series modeling. These are referred to as model selection in the 
literature. 



Akaikd (|l973l . [197J) proposes to choose a model th at mininiizes t he Kullback-Leibler (KL) 



divergence of the fitted model from the true model. [Akaikd (|1973[ ) considers the maximum 



likelihood estimator (MLE) 6 = {9i, • • • ,0p)^ of the parameter vector 6 and shows that, up 
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to an additive constant, the estimated KL divergence can be asymptotically expanded as 



~^n{0) + Adim(0) = -in{e) + A / 0), 

where ^n{'9) is the log-likelihoo d function , dim( 0) denotes the dimension of the model, and 
A = 1. This leads to the AIC. ISchwarta (|l978l ) takes a Bayesian approach with prior dis- 
tributions that have nonzero prior probabilities on some lower d imensional subspa. ces and 
proposes the BIC with A = (logn)/2 for model selection. Recently, Lv and Liu ( 2008 1) gave a 
KL divergence interpretation of Bayesian model selection and derive generalizations of AIC 
and BIC when the model may be misspecified. 

The work of AIC and BIC suggests a unified approach to model selection: choose a 
parameter vector that maximizes the penalized likelihood 



^{0)-\\\e\ 



(1) 



where the Lg-norm of counts the number of non-vanishing components in 6 and A > 
is a regularization parameter. Given \\0\\q = m, the solution to ([T|) is the subset with the 
largest maximum likelihood among all subsets of size m. The model size m is then chosen 
to maximize ([!]) among p best subsets of sizes m (1 < m < p). Clearly, the computation of 
the penalized Lq problem is a combinational problem with NP-complexity. 

When the normal likelihood is used, ([T]) becomes penalized least squares. Many tradi- 
tional methods can be regarded as penalized likelihood methods with different choices of 
A. Let RSSrf be the residu al sum o f squa, res of the best subset with d variables. Then 
Cp = RSS^/s^ + 2d — n in iMallowa (jl973l ) corresponds to A = 1, where is the mean 
squared error of the full model. The adjusted given by 



R 



n 



1 RSSjj 



adj 



d SST 



n 



also amounts to a penalized-Lo problem, where SST is the total sum of squares. Clearly 
maximizing R^^ is equivalent to minimizing log(RSSd/ {n — d)). By RSS^/n ~ cj^ (the error 
variance), we have 

-f^SSrf . , T-,nn / 2 I J I n 2 



nlog 



n — d 

2 



RSSd/cj^ + (i + n(log(j^ - 1). 



This shows that the adjusted R''' method is approximately equivalent to PMLE with A = 1/2. 
Other examples include the genera lized cross-validation (GC V) gi ven by RSS^/fl — d/ n)^. 



cross-validation (CV), and RIC in iFoster and Georgd (|l994l ). See iBickel and Lil (120061 ) for 
more discussions of regularization in statistics. 



3 Penalized likelihood 



As demonstrated above, Lq regularization arises naturally in many classical model selection 
methods. It gives a nice interpretation of best subset selection and admits nice sampling 
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properties ([Barron. Birge and MassartI (|l999l )l. However, the computation is infeasible in 
high dimensional statistical endeavors. Other penalty functions should be used. This results 
in a generalized form 

p 

n-HM-Y^Pxm), (2) 

i=i 

where iniP) is the log-likelihood function and p\{-) is a penalty function indexed by the 
regularization parameter A > 0. By maximizing the penalized likelihood ([2]), we hope to 
simultaneously select variables and estimate their associated regression coefficients. In other 
words, those variables whose regression coefficients are estimated as zero are automatically 
deleted. 

A natural gener alization of penalized Ln-re gression is penalized Lg-regression, called 
bridge regression in iFrank and FriedmanI (jl993l ). in which = -^l^l"^ for < q < 2. 

This bridges the best subset section (penalized Lq) and ridge regression (penalize d L2), in- 
cludin g the Li-penalty as a specific case. The non- negative garrote is introduced in iBreiman 
(jl995l ) for shrinkage estimati on and variable selection. Penalized Li-regression is called the 
LASSO by iTibshiranil (jl996l ) in the ordinary regression setting, and is now collectively re- 
ferred to as penalized Li-likelihood. Clearly, penalized -Lg-regression possesses the variable 
selection feature, whereas penalized L2-regression does not. What kind of penalty functions 
are good for niodel se lection? 



Fan and Lil (120011 ) advocate penalty functions that give estimators with three properties: 



1) Sparsity: The resulting estimator automatically sets small estimated coefficients to 
zero to accomplish variable selection and reduce model complexity. 

2) Unbiasedness: The resulting estimator is nearly unbiased, especially when the true 
coefficient I3j is large, to reduce model bias. 

3) Continuity: The r esulting estima tor is continuous in the data to reduce instability in 
model prediction (jBreimanl (|l996l )). 



They require the penalty function pa(|^|) to be nondecreasing in |^|, and provide insights 
into these properties. We first consider the penalized least squares in a canonical form. 



3.1 Canonical regression model 

Consider the linear regression model 



X/3 



(3) 



where X = (xi,-- - ,x„)'^, y = (yi,--- ,yn)'^, and s: is an n-dimensional noise vector. If 
e N{0,a'^ln), then the penalized likelihood ([2]) is equivalent, up to an affine transformation 
of the log-likelihood, to the penalized least squares (PLS) problem 



f 1 ^1 

min J ||y-X/3||2 + ^p;,(|/5,|)l 



(4) 
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where || • || denotes the L2-norm. Of course, the penahzed least squares continues to be 
apphcable even when the noise does not follow a normal distribution. 

For the canonical linear model in which the design matrix multiplied by n~^/^ is or- 
thonormal (i.e., X"^X = nip), ^ reduces to the minimization of 

i.\\y-xpf + 0-(3f+j2p>^i\m (5) 

where /3 = n~^X"^y is the ordinary least squares estimate. Minimizing ([5]) becomes a 
componentwise regression problem. This leads to considering the univariate PLS problem 

e{z) = Siigndni^iz - 9f +px{\e\)\ . (6) 



Antoniadis and FanI ((20011) show that the PLS estimator 9{z) possesses the properties: 



1) sparsity if mmt>o{t + p'x{t)} > 0; 

2) approximate unbiasedness if p'xit) = for large t; 

3) continuity if and only if argmin(>o{t + p'xit)} = 0) 

where px{t) is nondecreasing and continuously differentiable on [0, cxd), the function —t—p'^{t) 
is strictly unimodal on (0, cxd), and p'xii) means p'x{0+) when t = for notational simplicity. 
In general for the penalty function, the singularity at the origin (i.e., p'^{0+) > 0) is needed for 
generating sparsity in variable selection and the concavity is needed to reduce the estimation 
bias. 



3.2 Penalty function 



It is known that the convex Lq penalty with g > 1 does not satisfy the sparsity condition, 
whereas the convex Li penalty does not satisfy the unbiasedness condition, and the concave 
Lq penalty with < g < 1 does not satisfy the continuity condition. In other word s , none 
of t he Lq penalt i es sat isfies all three conditions simultaneously. For this reason, I FanI (|l997l ) 
and iFan and Lil (|200ll ) introduce the smoothly clipped absolute deviation (SCAD), whose 
derivative is given by 



p'x{t) = x\l{t<X) + 



(flA - 1)_ 
(a - 1)A 



I {t > X) > for some a > 2, 



(7) 



where pa(0) = and, often, a = 3.7 is used (suggested by a Bayesian argument). It satisfies 
the aforementione d three prop erties. A penalty of similar spirit is the minimax concave 
penalty (MCP) in IZhang (j2009l ). whose derivative is given by 



p'xit) = {aX-t),/a. 



(8) 
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Figure 2: Some commonly used penalty functions (left panel) and their derivatives (right panel). 
They correspond to the risk functions shown in the right panel of Figure [31 More precisely, A = 2 for 
hard thresholding penalty, A = 1.04 for Li-penalty, A — 1.02 for SCAD with a — 3.7, and A — 1.49 
for MCP with a = 2. 



Clearly SCAD takes off at the origin as the L\ penalty and then gradually levels off, and 
MCP translates the flat part of the derivative of SCAD to the origin. When 



PA(t) = A2-(A-t)5 



(9) 



Antoniadia (|l99d ) shows that the solution is the hard-thresholding estimator 9h(z ) = zl(\z\ > 



A). A family of concave penalties that bridge the Lq and Li penalties is studied bv lLv and Fan 
( 20091 ) for model selection and sparse recovery. A linear combination of Li and L2 penalties 
is called an elastic net by IZou and Hastid (j2005l ). which encourages some grouping effects. 
Figure [2] depicts some of those commonly used penalty functions. 

We now look at the PLS estimator 6(z) in ([6]) for a few penalties. Ea ch increasing penalty 
functi on gives a shrinkage rule: \9{z)\ < \z\ and 6{z) = sgn(2;)|6'(2:)| (lAntoniadis and Fan 
( 200ll )). The entropy penalty (Ln penalty) and the hard thresholding penalty yield the 
hard thresholding rule (IDonoho and Johnstond (119941)). while the Li penalty gives the soft 
thresholding rule JBickelT i 19831 ): bonoho and Johnstonel (|l994l )). The SCA D and MCP give 
rise to analytical solutions to ([6]), each of which is a linear spline in z (iFanI (jl997l )). 

How do those thresholded-shrinkage estimators perform? To compare them, we compute 
their risks in the fundamental model in which Z ^ N{9, 1). Let R{6) = E{9(Z) — 6)"^. 
Figure [3] shows the risk functions R{9) for some commonly used penalty functions. To make 
them comparable, we chose A = 1 and 2 for the hard thresholding penalty, and for other 
penalty functions the values of A were chosen to make their risks at = 3 the same. Clearly 
the penalized likelihood estimators improve the ordinary least squares estimator Z in the 
region where 9 is near zero, and have the same risk as the ordinary least squares estimator 
when 9 is far away from zero (e.g., 4 standard deviations away), except the LASSO estimator. 
When 9 is large, the LASSO estimator has a bias approximately of size A, and this causes 
higher risk as shown in Figure [3l When Ahard = 2, the LASSO estimator has higher risk 
than the SCAD estimator, except in a small region. The bias of the LASSO estimator makes 
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Figure 3: The risk functions for penalized least squares under the Gaussian model for the hard- 
thresholding penalty, Li-penalty, SCAD (a = 3.7), and MCP {a — 2). The left panel corresponds to 
A = 1 and the right panel corresponds to A = 2 for the hard-thresholding estimator, and the rest of 
parameters are chosen so that their risks are the same at the point 6 = 3. 



LASSO prefer a smaller A. For Ahard = the advantage of the LASSO estimator around 
zero is more pronounced. As a result in model selection, when A is automatically selected 
by a data-driven rule to compensate the bias problem, the LASSO estimator has to choose a 
smaller A in order to have a desired mean squared error. Yet, a smaller value of A results in 
a complex model. This explains why the LASSO estimator tends to have many false positive 
variables in the selected model. 



3.3 Computation and implementation 



It is challenging to solve th e penalized likeliho od problem ([2|) when the penalty function px 
is nonconvex. Nevertheless, iFan and Lvl ([20091) are able to give t he conditions under w hich 
the penalized likelihood estimator exists and is unique; see also iKim and KwonI (j2009l ) for 
the results of penalized least squares with SCAD penalty. When the Li-penalty is used, 
the objective function ([2]) is concave and hence convex optimization algorithms can be ap- 
plied. We show in this section that the penalized likelihood ([2]) can be solved by a sequence 
of rew eighted penalized Li-regression problems via local linear approximation (jZou and Li 

hood )). 



In the absence of other available algorithms at that time, iFan and Lil ([20011) propose a 
unified and effective local quadratic approximation (LQA) algorithm for optimizing noncon- 
cave penalized likelihood. Their idea is to locally approximate the objective function by a 



quadratic function. Specifically, for a given initial value (3* = {(31, 
function p\ can be locally approximated by a quadratic function as 

ipaC_/3;i), 

2 



] for Pj « P* 



the penalty 



(10) 



With this and a LQA to the log-likelihood, the penalized likelihood ([2]) becomes a least 
squares problem that admits a closed-form solution. To avoid numerical instability, it sets 
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Figure 4: The local linear (dashed) and local quadratic (dotted) approximations to the SCAD 
function (solid) with A = 2 and a — 3.7 at a given point \9\ = 4. 



the estimated coefficient I3j = if /3j is very close to 0, which amounts to deleting the j-th 
covariate from the final model. Clearly the value is an absorbing state of LQA in the sense 
that once a coefficient is set to zero, it remains zero in su bsequent iterati o ns. 

The convergence property of the LQA was studied in iHunter and Lil (120051). who show 
that L QA plays the same role as the E-step in the EM algorithm in lDempster. Laird and Rubin 
(jl977l ). Therefore LQA has similar behavior to EM. Although the EM requires a full itera- 
tion for maximization after each E-step, the LQA updates the quadratic approximation at 
each step during the course of iteration, which speeds up the convergence of the algorithm. 
The converg ence ra t e of L QA is quadratic, which is the same as that of the modified EM 
algorithm in iLangd (|l995l ). 

A better approximation can be achieved by using the local linear approximation (LLA): 



pxm) « pxm)+Pxm)m - for /?, « f3*.. 



(11) 



as m iZou and Lil (|2008l ). See Figure[l]for an illustration of the local linear and local quadratic 
approximations to the SCAD function. Clearly, both LLA and LQA are convex majorants 
of concave penalty function p\{-) on [0,oo), but LLA is a better approximation since it is 
the minimum (tightest) convex majorant of the concave function on [0, oo). With LLA, the 
penalized likelihood ([2]) becomes 

p 



(12) 



where the weights are Wj = p'x{\(i*j\)- Problem (jl2p is a concave optimization problem if 
the log-likelihood function is concave. Different penalty functions give different weighting 
schemes, and LASSO gives a constant weighting scheme. In this sense, the nonconcave 
penalized likelihood is an iteratively reweighted penalized Li regression. The weight function 
is chosen adaptively to reduce the biases due to penalization. For example, for SCAD and 
MCP, when the estimate of a particular component is large so that it has high confidence to 
be non- vanishing, the component does not receive any penalty in ()12p . as desired. 
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Zoul (|2006l ) proposes the weighting scheme wj = for some 7 > and calls the 

resulting procedure adaptive LASSO. This weight reduces the penalty when the previous 
estimate is large. However, the penalty at zero is infinite. When the procedure is applied 
iteratively, zero becomes an absorbing state. On the other hand, the penalty functions such 
as SCAD and MCP do not have this undesired property. For example, if the initial estimate 
is z ero, then Wj = A and th e resul ting e stimate is the LAS SO estimate. 



Fan and Lil (j200l|) , IZoul (j2006l ) , and IZou and Lil (120081 ) all suggest a consistent estimate 



such as the un-penalized MLE. This implicitly assumes that p <^ n. F or dimensionality 
that is larger than sample size n, the above method is not applicable. iFan and Lvl ( 200 
recommend using /3* = 0, which is equivalent to using the LASSO estimate as the initial 
estimate. Another possible initial value is to use a stepwise addition fit or componentwise 
regression. They pu t forward the reco mmendation that only a few iterations are needed, 
which is in line with lZou and Lil ([2008]). 

Before we close this section, we remark that with the LLA and LQA, the resulting 
sequence of target values is always nondecreasing, wh i ch is a specific feature of minorization- 
maximization (MM) algorithms (IHunter and Langd (I2OOOI )). Let p\{(3) = Yl'j=iP>^(.\Pj\)- 
Suppose that at the A:-th iteration, px{(3) is approximated by q\{(3) such that 



PxiP) < qxi(3) and pxiP^"^) = QxiO^''^), 



(13) 



where P^'^^ is the estimate at the k-th. iteration. Let (3^''^^^ maximize the approximated 
penalized likelihood n^^^„(/3) — qx{f3). Then we have 



n 



PxiP 



> 
> 



n 



-qxifS^'^'^] 

qxiP^'^) 

Px{P^% 



Thus, the target values are non-decreasing. Clearly, the LLA and LQA are two specific cases 
of the MM algorithms, satisfying condition (I13p : see Figure SI Therefore, the sequence of 
target function values is non-decreasing and thus converges provided it is bou nded. The 
critical point is the global maximizer under the conditions in lFan and Lvl (j2009l ). 



3.4 LARS and other algorithms 

As demonstrated in the previous section, the penalized least squares problem ^ with an 



Osborne, Presnell and Turlach 


3m. 


Efron et al. ( 


20041 propose 



a fast and efficient least angle regression (LARS) algorithm for variable selection, a simple 
modification of which produces the entire LASSO solution path {/9(A) : A > 0} that optimizes 
dl]). The computation is based o n the fact that the LASSO solution path is piecewise linear 



in A. See lRosset and Zhul (120071 ) for a more general account of the conditions under which 
the solution to the penalized likelihood ([2]) is piecewise linear. The LARS algorithm starts 
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from a large value of A which selects only one covariate that has the greatest correlation 
with the response variable and decreases the A value until the second variable is selected, 
at which the selected variables have the same c orrelation (in mag nitude) with the current 
working residual as the first one, and so on. See lEfron et alj (| 20041 ) for details. 

The idea of the LARS a lgorithm can be expanded to compute the solution paths of 



penalized least squares Q. I Zhang] (j2009l ) introduces the PLUS algorithm for efficiently 
computing a solution path of ([4]) wh en the penalt y function px{-) is a quadratic spline such 



as the SCAD and MCP. In addition, IZhangj (120091 ) also shows that the solution path /9(A) is 
piecewise linear in A, and the proposed solution pat h has desired statistical properties. 



For the penalized least squar es problem (jH) , |Fu| (jl998l ) , iDaubechies. Defrise and De Mol 



(|200J), and IWu and Land (j2008l ) propose a coordinate descent algorithm, which iteratively 
optimizes (jH) one component at a time. This algorithm can als o be applied to optimize the 



group LASSO (jAntoniadis and FanI (j200ll ): lYuan and LinI (120061)) as shown inlMeier. van de Geer and Biihlman. 
(|2008l ). penalized precisi on matrix est i matio n (IFriedman. Hastie and Tibshiranil (j2007l )). and 
penalized likehhood (|2D dPan and"!^ (I2OO9I ): Izhang and hi |2009l )). 

More specifically, Fan and Lv ( 20091 ) employ a path- following coordinate optimization 
algorithm, called the iterative coordinate ascent (ICA) algorithm, for maximizing the non- 
concave penalized likelihood. It successively maximizes the penalized likeliho od (|2l) for reg- 
ularization parameters A in decreasing order. A similar idea is also studied in lzhang and Lil 
(|2009l ). who introduce the ICM algorithm. The coordinate optimization algorithm uses 
the Gauss-Seidel method, i.e., maximizing one coordinate at a time with successive dis- 
placements. Specifically, for each coordinate within each iteration, it uses the second order 
approximation of ini(3) at the p- vector from the previous step along that coordinate and 
maximizes the univariate penalized quadratic approximation 



ma^<^--{z-Or-Apx{\e\) 



(14) 



where A > 0. It updates each coordinate if the maximizer of the corresponding univari- 
ate penalized quadratic approximation makes the penalized likelihood ([2]) strictly increase. 
Therefore, the ICA algorithm enjoys the ascent property that the resulting sequence of values 
of the penalized likelihood is increasing for a fixed A. Compared to other algorithms, the 
coordinate optimization algorithm is especially appealing for large scale problems with both 
n and p large, thanks to its low computational complexity. It is fast to implement when the 
univariate problem (jl4p admits a closed-form solution. This is the case for many commonly 
used penalty functions such as SCAD and MCP. In practical implementation, we pick a suf- 
ficiently large Amax such that the maximizer of the penalized likelihood ([21) with A = Xmax is 



0, and a decreasing sequence of regularization parameters. The studies in lFan and Lvl (|2009l ) 
show that the coordinate optimization works equally well and efficiently for producing the 
entire solution paths for concave penalties. 

The LLA algorithm for computing penalized likelihood is now available in R at 



|http:/ / cran.r-project.org/web/packag es / SIS / index.html 
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as a function in the SIS package. So is the PLUS algorithm for computing the penahzed 
least squares estimator with SCAD and MC+ penalties. The Matlab codes are also available 
for the ICA algorithm for computing the solution path of the penalized likelihood estimator 
and for computing SIS upon request. 



3.5 Composite quasi-likelihood 

The function £n(0) in dZl) does not have to be the t rue likelihood. It can be a quasi-likelihood 
or a loss function (|Fan. Samworth and Wul (|2009l )). In most statistical applications, it is of 
the form 

n p 

n-i^g(xf/3,y,)-E^^A(|/3,|), (15) 
i=i j=i 

where Q{'k[ I3,yi) is the conditional quasi-likelihood of Yi given Xj. It can also be the loss 
function of using to predict yi. In this case, the penalized quasi-likelihood (|15p is written 
as the minimization of 

n p 

n-iJ^L(xf/3,y,)+E^'A(l/3il), (16) 

i=i j=i 

where L is a loss function. For example, the loss function can be a robust loss: L(x, y) = 
\y — x\. How should we choose a quasi-likelihood to enhance the efficiency of procedure when 
the error distribution possibly deviates from normal? 

To illustrate the idea, consider the linear model As long as the error distribution of 
e is homoscedastic, x^/3 is, up to an additive constant, the conditional r quantile of yi given 
Xj. Therefore, (3 can be estimated by the quantile regression 



xf/3), 



i=l 



where Pt{x) = tx+ -|- (1 — r)x_ (jKoenker and BassettI (jl978l )). iKoenkeij (jl984l ) proposes 
solving the weighted composite quantile regression by using different quantiles to improve 
the efficiency, namely, minimizing with respect to 6i, • • • ^hx and /3, 



K 



^ Wfc ^ /^Tfe {Vi -bk- xf /3), 



(17) 



k=l 



i=l 



where {r^} is a given sequence of quantiles and {wk} is a given sequence of weights. IZou and Yuan 
(|2008l ) propose the penalized composite quantile with equal weights to improve the efficiency 

of the penaliz ed least squares. 

Recently, iBradic. Fan and Wangj (|2009l ) proposed the more general composite quasi- 
likelihood 



K 



'^Wk'^Lk{x.ff3,yi) + ^PA(|/5j 



(18) 



k=l 



i=l 
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They derive the asymptotic normahty of the estimator and choose the weight function to 
optimize the asymptotic variance. In this view, it always performs better than a single quasi- 
likelihood function. In particular, they study in detail the relative efficiency of the composite 
L1-L2 loss and optimal composite quantile loss with the least squares estimator. 

Note that the composite likelihood (jlSp can be regarded as an approximation to the 
log-likelihood function via 

K 

log/(y|x) = log/(y|x^/3) « - ^ u;,Lfc(x^/3, y) 

k=l 

with '}2k=i ^fc = 1- Hence, Wk can also be chosen to minimize (llSp directly. If the convexity 
of the composite likelihood is enforced, we need to impose the additional constraint that all 
weights are non- negative. 



3.6 Choice of penalty parameters 

The choice of penalty parameters is of paramount importance in penalized likelihood esti- 
mation. When A = 0, all variables are selected and the model is even unidentifiable when 
p > n. When A = 00, if the penalty satisfies limA^oo Pa(|^|) = 00 for / 0, then none of the 
variables is selected. The interesting cases lie between these two extreme choices. 

The above discussion clearly indicates that A governs the complexity of the selected 
model. A large value of A tends to choose a simple model, whereas a small value of A inclines 
to a complex model. The estimation using a larger value of A tends to have smaller variance, 
whereas the estimation using a smaller value of A inclines to smaller modeling biases. The 
trade-off between the biases and variances yields an optimal choice of A. This is frequently 
done by using a multi-fold cross-validation. 



There are relatively few studies on the choice of penalty parameters. In lWang. Li and Tsai 



(|2007l ). it is shown that the model selected by generalized cross-validation using the SCAD 
penalty contains all important variables, but with nonzero probability includes some unim- 
portant variables, and that the model selected by using BIC achieves the model selection 
consistency and an oracle property. It is worth to point out that missing some true pre- 
dictor causes model misspecification, as does misspecifyi ng the fami l y of d istributions. A 



semi-Bayesian information criterion (SIC) is proposed bv lLv and Liul (|2008l ) to address this 
issue for model selection. 



4 Ultra-high dimensional variable selection 

Variable selection in ultra-high dimensional feature space has become increasingly important 
in statistics, and calls for new or extended statistical methodologies and theory. For exam- 
ple, in disease classification using microarray gene expression data, the number of arrays is 
usually on the order of tens while the number of gene expression profiles is on the order of 
tens of thousands; in the study of protein-protein interactions, the number of features can 
be on the order of millions while the sample size n can be on the order of thousands (see. 
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Figure 5: Illustration of ultra-high dimensional variable selection scheme. A large scale screening 
is first used to screen out unimportant variables and then a moderate-scale searching is applied to 
further select important variables. At both steps, one can choose a favorite method. 



e.g., iTibshirani et al.l (|2003l ) and iFan and RenI (|2006l )): the same order of magnitude occurs 
in genetic association studies between genotypes and phenotypes. In such problems, it is im- 
portant to identify significant features (e.g., SNPs) contributing to the response and reliably 
predict certain clinical prognosis (e.g., survival time and cholesterol level). As mentioned in 
the introduction, three important issues arise in such high dimensional statistical endeav- 
ors: computational cost, statistical accuracy, and model interpretability. Existing variable 
selection techniques can become computationally intensive in ultra-high dimensions. 

A natural idea is to reduce the dimensionality p from a large or huge scale (say, logp = 
0(n") for some o > 0) to a relatively large scale d (e.g., 0{n'') for some 6 > 0) by a fast, 
reliable, and efficient method, so that well-developed variable selection techniques can be 
applied to the reduced feature space. This provides a powerful tool for variable selection 
in ultra-high dimensional feature space. It addresses the aforementioned three issues when 
the variable screening procedures are capable of retaining all the im portant variables w ith 
asymptotic probability one, the sure screening property introduced in iFan and Lvl (j2008l ). 

The above discussion suggests already a two-scale method for ultra-high dimensional vari- 
able selection problems: a crude lar ge scale screening fo llowed by a moderate scale selection. 
The idea is explicitly suggested by iFan and Lvl (j2008l ) and is illustrated by the schematic 
diagram in Figure [H One can choose one of many popular screening techniques, as long as 
it possesses the sure screening property. In the same vein, one can also select a preferred 
tool for moderate scale selection. The large-scale screening and moderate-scale selection can 
be ite ratively applied, resulting in iterative sure indep endence screening (ISIS) (IFan and Lv 
(|2008l )). Its amelioration and extensions are given in iFan. Samworth and wJ ^200^ . who 



also develop R and Matl a b cod es to facilitate the implementation in generalized linear models 
(|McCullagh and Nelderl (jlOsd ^l. 



4.1 Sure independence screening 

Independence screening refers to ranking features according to marginal utility, namely, 
each feature is used independently as a predictor to decide its us efulness for p redict ing 
the response. Sure independence screening (SIS) was introduced by iFan and Lvl (j2008l ) to 
reduce the computation in ultra-high dimensional varia ble selection : all im portant features 
are in the selected model with probability tending to 1 (|Fan and Lvl (|2008l )). An example of 
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independence learning is the correlation ranking proposed in lFan and Lvl (j2008l ) that ranks 
features according to the magnitude of its sample correlation with the response variable. 
More precisely, let u = (wi,-- - ,a;p)^ = X"^y be a p- vector obtained by componentwise 
regression, where we assume that each column of the n x p design matrix X has been 
standardized with mean zero and variance one. For any given take the selected submodel 
to be 

■Md = {1 < i < P : \^j\ is among the first dn largest of all}. (19) 

This reduces the full model of size n to a submodel with size dn, which can be less than 
n. Such correlation learning screens those variables that have weak marginal correlations 
with the response. For classification problems with Y = ±1, the correlation ranking reduces 
to selecting features by using two-sample t-test statistics. See Section 14.21 for additional 
details. 

Other examples of independence learning include methods in microarray data analysis 
where a two-sample test is used to sel e ct sig n ificant genes between the trea t ment and coii t rol 
groups jPudoit. Shaffer and Boldricj j^Ooi) ; Istorev and Tibshiranil (l20oi'l:lFan and RenI ^20od ) 



feature ranking using a generalized correlation 



pa rametric learning under sp a rse ad ditive models (IRavikumar et al.l ([20091)), and the method 



Hall and Milleil (|2009al ')1. non 



Huang. Horowitz and Mai (j2008l ) that uses the margi nal bridge estimators for se l ecting 



m 

variables in high dimensional sparse regression models. iHall. Titterington and Xud ([20091) 
derive some independence learning rules using tilting methods and empirical likelihood, and 
propose a bootstrap method to as sess the fidelity of feature rankin g. In particular, the false 
discovery rate (FDR) proposed bv lBenjamini and Hochbergl (jl995l ) is po pularly used i n mu l- 
tiple testing for c o ntroll ii ig the expected f alse p ositiv e rate. See also lEfroii et al.l ([20011 ) . 
Abramovich et al.l (|2006l ^ . lOonoho and JinI ^20od ). and lciarke and Halll 

We now discuss the sure screening property of correlation screening. Let Ai^ = {1 < j < 
p ■ (3j ^ 0} be the true underlying sparse model with nonsparsity size s = \M.*\; the other 
p — s variables can also be corre l ated w ith the response variable via the link to the predictors 
in the true model. iFan and Lvl (120081 ) consider the case p ^ n with logp = 0{n°') for some 
a G (0, 1 — 2k), where k is specified below, and Gaussian noise e ~ A^(0, cj^) for some a > 0. 
They assume that var(y) = 0(1), Amax(S;) = O(n^), 

min \f3j\ > cn~'^ and min \cov{P~^Y, Xj)\ > c, 



where 5] = cov(x), k, r > 0, c is a positive constant, and the p-dimensional covariate 
vector x has an elliptical distribution with the random matrix XS~^/^ having a concentra- 
tion property that holds for Gaussian distributions. For studies on the e xtreme eigenvalue s 
and limiting spect r al di s tribut i ons of large rand om m atrices, s ee, e . g.. ISilversteinI (| 19851 ). 



Bai and YinI (Il993l ^. Bail (|l999l l. I Johnstone (l200ll'l. andlLedouxl (l200ll . l2005l ^. 



Under the above regularity conditions. 



Fan and Lvl (|2008l ) show that if 2k + r < 1, then 
there exists some 6 G (2k + r, 1) such that when dn ~ n^, we have for some C > 0, 



PiM^ C Md) = 1 - 0{pe 



VI 



(20) 
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In particular, this sure screening property entails the sparsity of the model: s < dn- It 
demonstrates that SIS can reduce exponentially high dimensionality to a relatively large 
scale dn <C n, while the reduced model still contains all the important variables with 
an overwhelming probability. In practice, to be conservative we can choose d = n — 1 or 
[n/logn]. Of course, one can also take final model size d>n. Clearly larger d means larger 
probability of including the true underlying sparse model in the final model Md- See 
Section 4.3 for further results on sure independence screening. 

When the dimensionality is reduced from a large scale p to a moderate scale d by applying 
a sure screening method such as correlation learning, the well-developed variable selection 
techniques, such as penalized least squares methods, can be applied to the reduced feature 
space. This is a powerful tool of SIS based variable selection methods. The sampling proper- 
ties of these methods can be easily obtained by combining the theory of SIS and penalization 
methods. 



4.2 Feature selection for classification 



Independence learning has also been widely used for feature selection in high dimensional 
classification problems. In this section we look at the specific setting of classification and 
continue the topic of independence learning for variable selection in Section 14. 3i Consider 
the p-dimensional classification between two classes. For k G {1,2}, let X^i, Xfc2, • • • , ^kru, 
be i.i.d. p-dimensional observations from the A:-th class. Classification aims at finding a 
discriminant function (5(x) that classifies new observations as accurately as possible. The 
classifier J(-) assigns x to the class 1 if 5(x) > and class 2 otherwise. 

Many classification methods have been proposed in the literature. The best classifier is 
the Fisher 5f{'^ = (x — yLt)'5]~^(//]^ — when the data are from the normal distribution 
with a common covariance matrix: X^j ~ A^(//^,5]), for k = 1,2 and n = {fii + /X2)/2. 
However, this method is hard to implement when dimensionality is high due to the difficulty 
of estimating the unknown covariance matrix Hence, the independence rule that involves 
estimating the diagonal entries of the covariance matrix, with discriminant function 5(x) = 
(x — /i)'D~^(/i]^ — /X2) is frequently e mployed for the classifica tion, where D = diagjS}. For 
a survey of recent developments, see iFan. Fan and Wul (|201Cll ). 

Classical methods br eak down when the dimensionality is high. As demonstrated by 
Bickel and Levinal (120041 ) . the Fisher discrimination method no longer performs well in high 
dimensional settings due to the diverging spectra and singularity of the sample covariance 
matrix. They show that the independence rule overcomes these problems and outperforms 
the Fisher discriminant in high dimensional setting. However, in practical implementation 
such as tumor classification using microarray data, one hopes to find tens of genes that have 
high discriminative power. The independence rule does not possess the property of feature 
selection. 

The noise accumulation phenomenon is well-kn own in the regressio n setup, but has never 
been quantified in the classification problem until iFan and FanI (120081 ). They show that the 
difficulty of high dimensional classification is intrinsically caused by the existence of many 
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noise features that do not contribute to the reduction of classification error. For example, 
in linear discriminant analysis one needs to estimate the class mean vectors and covariance 
matrix. Although each parameter can be estimated accurately, aggregated estimation error 
can be very large and can significantly increase the misclassification rate. 

Let Ro be the common correlation matrix, Amax(Ro) be its largest eigenvalue, and a = 
A*i ~ ^2- Consider the parameter space 



r = {{a, S) : cx'T>^cx > Cp, A 



;H-o) < bo, min a- > 0}, 

i<i<p 



where Cp and 60 are given constants, and (t| is the j-th diagonal element of S. Note 
that Cp measures the strength of signals. Let 5 be the estimated discriminant function 
of the independence rule, obtained by plugg ing in the sample estimates of a and D. If 
■\/ nin2 / {np)Cp Dq > O. lFan and FanI ( 20081 ) demonstrate that the worst case classification 
error, W{6), over the parameter space F converges: 



W{5) 



1 _ $ 



Dn 



2Vbo 



(21) 



where n = ni + 712 and $(•) is the cumulative distribution function of the standard normal 
random variable. 

The misclassification rate (j2ip relates to dimensionality in the term Dq, which depends 
on Cp/y/p. This quantifies the tradeoff between dimensionality p and the overall signal 
strength Cp. The signal Cp always increases with dimensionality. If the useful features are 
located at the first s components, say, then the signals stop increasing when more than 
s features are used, yet the penalty of using all features is Clearly, using s features 
can perform much better than using all p features. The optimal number should be the one 
that minimizes Cm/\^, where the Cm are the signals of the best subset 5* of m features, 
defined as cxsT)g^cxs, where 0:5 and are the sub-vector and sub-matrix of a and D 
constructed using variables in S. The result (j2ip also indicates that the independence rule 
works no better than random guessing due to noise accumulation, unless the sig n al lev els are 



extremely high, say, \JnJpCp > B for some i? > 0. iHall. Pittelkow and GhoshI ( 20081 ) show 
that if Cp/p — > 00, t he classification erro r goes to zero for a distance-based classifier, which 
is a specific result of lFan and FanI (j2008l ) with B = 00. 

The above results reveal that dimensionality reduction is also very important for reducing 
misclassification rate. A popular class of dimen sionality redu ctio n techniques is projection. 



See, fo r example, principal com ponent analysis inlGhoshI (120021) andlZou. Hastie and Tibshirani 



(|200J); partial lea st squares in 



verse regres sion m 



Huang and FanI (120031 ) ; and 



Boulestei> 



xl kooi): 



and sliced in- 



Chiaromonte and Martinelli ( 20021 ). Antoniadis. Lambert-Lacroix and Leblanc 



(|2003l ). and iBura and Pfeiffeii (j2003l ). These projection methods attempt to find directions 
that can result in small classification errors. In fact, the directions that they find usually 
put much larger weights on features with large classifica tion power, which is indeed a type 
of sparsity in the projection vector. iFan and FanI (j2008l ) formally show that linear projec- 
tion methods are likely to perform poorly unless the projection vector is sparse, namely, the 
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effective number of selected features is small. This is due to the aforementioned noise accu- 
mulation whe n estimating an d hi gh dimensional p r oblein s . For formal results, se e 
Theorem 2 inlPan and FanI (I2OO8I 1 . See alsolxibshirani et all (I2OO2I) [Ponoho and Jh\ (20081. 
Hall. Park and SamworthI (l2008ll. iHall. Pittelkow and GhoshI (|2008l l . iHall and ChanI ^20od ). 
Hall and Milled (|2009bl ) , and Ijini (j2009l ) for some recent developments in high dimensional 
classifications. 

To select irnportan t features, the two-sample t test is frequently employed (see, e.g., 
Tibshirani et al.l (|2003l )). The two-sample t statistic for feature j is 



X 



X 



2j 



(22) 



where X^j and S'^j are the sample mean and variance of the j-th feature in class k. This 
is a specific examp le of independence learning, which ranks the features according to \Tj\. 



Fan and FanI (120081 ) prove that when dimensionality p grows no faster than the exponential 
rate of the sample size, if the lowest signal level is not too small, the two-sample t test can 
select all important features with probability tendi ng to 1. Their pr o of relies on the devia- 
tion r esults of the two-s ample t-statistic. See, e.g., iHalll (Il987l . 120061 ). iJing. Shao and Wang 
(j2003l ) , and ICao I (|2007l ) for large deviation theory. 

Although the t test can correctly select all important features with probability tending 
to 1 under some regularity conditions, the resulting choice is not necessarily optimal, since 
the noise accumulation can exceed the signal accumulation for faint features. Therefore, 
it is necessary to f urther single out the most important features. To address this issue. 
Fan and FanI (120081 ) propose the Features Annealed Independence Rule (FAIR). Instead of 
constructing the independence rule using all features, FAIR selects the most important ones 
and uses them to construct an independence rule. To appreciate the idea of FAIR, first note 
that the relative importance of features can be measured by \aj\/aj, where aj is the j-th 
component oi a = fii — and (t| is the common variance of the j-th feature. If such 
oracle ranking information is available, then one can construct the independence rule using 
m features with the largest \aj\/aj, with optimal value of m to be determined. In this case, 
the oracle classifier takes the form 



•^(x) = ^aj{xj - /ij)/<T|l||„^.|/^^,>fe}, 



where 6 is a positive constant. It is easy to see that choosing the optimal m is equivalent to 
selecting the optimal b. However oracle information is usually unavailable, and one needs to 
learn it from the data. Observe that \aj\/(Tj can be estimated by \cej\/aj, where the latter is in 
fact ^^n/{nln2)\Tj\, in which the pooled sample variance is used. This is indeed the same as 
ranking the featur e by using the corr elation between the jth vari able with the class response 
ifcl wh en Til = n2 (jFan and Lvl (j2008l )). Indeed, as pointed out bv lHall. Titterington and Xue 
(1200a), this is always true if the response for the first class is assigned as 1, whereas the 
response for the second class is assigned as —ni/n2- Thus to mimick the oracle, FAIR takes 
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a slightly different form to adapt to the unknown signal strength 



W(x) = aj{x, - Ai)/<^i l{^;^7(;^|T,|>b}- (23) 
i=i 

It is clear from ()23p that FAIR works the same way as if we first sort the features by the 
absolute values of their t-statistics in descending order, then take out the first m features to 
construct a classifier. The number of features is selected by minimizing the upper bound of 
the classification error: 



n\ 



1 ^[TJjLi Tu) + m{ni - na)/ 

m = arg max -— ; — , 

i<m<p A^^^ mnina + nina ^JL^ Tf.^ 

where T"^^-^ > T^|^ > • • • > T'^^-^ are the ordered squared t-statistics, and A^ax is the estimate 
of the largest eigen value of the correlation matrix Rq^ of the m most significant features. 



Fan and FanI (|2008l ) also derive the misclassification rates of FAIR and demonstrate that it 



possesses an oracle property. 

4.3 Sure independence screening for generalized linear models 

Correlation learning cannot be directly applied to the case of discrete covariates such as 
ge netic studies with different genotypes. The mathematical results and technical arguments 



m lFan and Lvl (120081 ) rely heavily on the joint normality assumptions. The natural question 
is how to screen variables in a more general context, and whether the sure screening property 
continues to hold with a limited false positive rate. 

Consider the generalized linear model (GLIM) with canonical link. That is, the condi- 
tional density is given by 

/(y|x) = exp {yd{-K) - 6(0(x)) + c{y)] , (24) 

for some known functions 6(-), c(-), and 0(x) = x"^/3. As we consider only variable selection 
on the mean regression function, we assume without loss of generality that the dispersion 
parameter = 1. As before, we assume that each variable has been standardized with mean 
and variance 1. 

For GLIM ([M]), the penahzed likehhood ^ is 

n p 

-n-i^£(xf/3,y,)-E^A(|/?il), (25) 
i=i j=i 

- M 

where i{9,y) = h{6) — yO. The maximum marginal likelihood estimator (MMLE) (3j is 
defined as the minimizer of the componentwise regression 

n 

/flf = 0%, Pf) = argmin^^,^^, ^ + PjXij , Yi), (26) 

i=l 
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where Xij is the ith observation of the jth variable. This can be easily computed and its 
implementation is robust, avoiding numerical instability in ultra-high dimensional problems. 
The marginal estimator estimates the w rong object of cours e, but its magnitude provides 
useful information for variable screening. iFan and SongI (|2009l ) select a set of variables whose 
marginal magnitude exceeds a predefined threshold value jn- 



M 



{l<i<p:|/3f|>7n} 



7n 



(27) 



This is equivalent to ranking features according to the magnitude of MMLEs To 
understand the utility of MMLE, we take the population version of the minimizer of the 
componentwise regression to be 



/3f 



(/3j!^,/5ff 



argmin^g « .B£(/?o + (3jXj, Y). 



Fan and SongI (|2009l ) show that = if and only if cov(Xj,y) = 0, and under some 
additional conditions if |cov(Xj, Y)\ > cin~'^ for j £ Mi,, for given positive constants ci and 
K, then there exists a constant C2 such that 



min 1/3, 



Ml 



>C2n 



(28) 



In words, as long as Xj and Y are somewhat marginally correlated with n < 1/2, the marginal 
signal is detectable. They prove further the sure screening property: 



P Mi^cM 



In 



(29) 



(the convergence is exponentially fast) if 7^ = C3n~'^ with a sufficiently small C3, and that 
only the size of non-sparse elements (not the dimensionality) matters for the purpose of 
sure screening property. For the Gaussian linear model ([3]) with sub-Gaussian covariate 
tails, the diinensio nality can be as high as logp = o(n(^^^'*)/^), a weaker result than that in 
Fan and Lvl (|2008l ) in terms of condition on p, but a stronger result in terms of the conditions 
on the covariates. For logistic regression with bounded covariates, such as genotypes, the 
dimensionality can be as high as logp = o(n^~^''). 

The sure screening property (|29|) is only part of the story. For example, if 7^ = then 
all variables are selected and hence (f29]l holds. The question is how large the size of the 
selected mode l size in ([27|) with 7^ = c^n''^ should be. Under some regularity conditions, 
Fan and SongI (j2009l ) show that with probability tending to one exponentially fast. 



l-M7nl =0{n2''A^ax(S)}. 



(30) 



In words, the size of selected model depends on how large the thresholding parameter 7^ is, 
and how correlated the features are. It is of order Ofn^"^"'"'^) if Ainax(5]) = 0{n'^). This is the 



same or somewhat stronger result than in lFan and Lvl (120081 ) in terms of selected model size, 
but holds for a much more general class of models. In particularly, there is no restrictions 
on K and r, or more generally Aniax(^)- 
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Fan and Sona (|2009l ) also study feature screening by using the marginal likelihood ratio 
test. Let Lq = Tainf^^^n"^ Yl^=i ^{Po,Yi) and 

n 



Li = L 







min/3(,,/3j.n 



i=l 



'0 



+ Pj^ij,Yi)- 



(31) 



Rank the features according to the marginal utility {Lj}. Thus, select a set of variables 

={l<j<Pn- Lj > Un}, (32) 

where Un is a predefined threshold value. Let L* be the population counterpart of Lj. 
Then, the minimum signal minjg_A4, is of order 0{n~'^'^), whereas the individual noise 
Op(n~^/^). In words, when k > 1/4, the noise level is larger than the signal. 



Li 



This is the key technical c hallenge. By using the fact that the ranking is invariant to 



monotonic transformations, iFan and Songl (|2009l ) are able to show that with Vn 
for a sufficiently small C4 > 0, 

P{M, C \Aj < 0(n2-A^ax(5]))} ^ 1. 

Thus the sure screening property holds with a limited size of the selected model. 



-2k 



4.4 Reduction of false positive rate 

A screening method is usually a crude approach that results in many false positive variables. 
A simple idea of reducing th e false positive rate is to apply a resampling technique as proposed 
by iFan. Samworth and Wul (|2009l ) . Split the samples randomly into two halves and let ^1 
and A2 be the selected sets of active variables based on, respectively, the first half and the 
second half of the sample. If ^1 and A2 both have a sure screening property, so does the 
set A. On the other hand, A = Aif\ A2 has many fewer falsely selected variables, as an 
unimportant variable has to be selected twice at random in the ultra-high dimensional space, 
which is very unlikely. Therefore, A reduces the number of false positive variables. 

Write A for the set of active indices - that is, the set containing those indices j for which 
fij / in the true model. Let d be the size of the select ed sets Ai and A2. Under some 
exchangeability conditions, iFan. Samworth and Wul (|2009l ) demonstrate that 

P(l^'n^1>0<^<i(^)^ (33) 

where, for the second inequality, we require that d < n < {p — |^|)^/^. In other words, the 
probability of selecting at least r inactive variables is very small when n is small compared 
to p, such as for the situations discussed in the previous two sections. 

4.5 Iterative sure independence screening 

SIS uses only the marginal information of the c ovariates and its su re screening property can 
fail when technical conditions are not satisfied. iFan and Lvl (120081 ) point out three potential 
problems with SIS: 



23 



a) (False Negative) An important predictor that is marginally uncorrelated but jointly 
correlated with the response cannot be picked by SIS. An example of this has the 
covariate vector x jointly normal with equi-correlation p, while Y depends on the 
covariates through 

x^^'^ = Xi + --- + Xj- JpXj+i. 

Clearly, Xj^i is independent of Ji.'^fS* and hence Y, yet the regression coefficient —Jp 
can be much larger than for other variables. Such a hidden signature variable cannot 
be picked by using independence learning, but it has a dominant predictive power on 
Y. 

b) (False Positive) Unimportant predictors that are highly correlated with the important 
predictors can have higher priority to be selected by SIS than important predictors 
that are relatively weakly related to the response. An illustrative example has 

Y = pXo + Xi + ---+Xj + e, 

where Xq is independent of the other variables which have a common correlation p. 
Then corr(Xj, Y) = Jp = J corr(Xo, Y), for j = J + 1, • • • and Xq has the lowest 
priority to be selected. 

c) The issue of collinearity among the predictors adds difficulty to the problem of variable 
selection. 

Translating a) to microarray data analysis, a two-sample test can never pick up a hidden 
signature gene. Yet, missing the hidden signature gene can result in very poor under standing 



of the molecular mechanism and in poor disease classification. iFan and Lvl (|2008l ) address 
these issues by proposing an iterative SIS (ISIS) that extends SIS and uses more fully the 
joi nt information of the cov a riates . ISIS still maintains computational expediency. 



Fan. Samworth and Wul (|2009l ) extend and improve the idea of ISIS from the multiple 
regression model to the more general loss function (I16p : this includes, in addition to the 
log-likelihood, the hinge loss L{x, y) = (1 — xy)+ and exponential loss L( x, y) = exp(— x?/) i n 



classification in which y takes values ±1, among others. The ^-learning (|Shen et al.l (|2003l )) 
can also be cast in this framework. ISIS also allows variable deletion in the process of 
iteration. More generally, suppose that our objective is to find a sparse f3 to minimize 



n p 

n-ij;L(y„xf/3) + ^PA(|/?il 
i=i j=i 



The algorithm goes as follows. 

1. Apply an SIS such as (j32p to pick a set Ai of indices of size ki, and then employ a 
penalized (pseudo)-likelihood method ([15]) to select a subset TWi of these indices. 
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2. (Large-scale screening) Instead of computing residuals as in iFan and Lvl (|2008l ). com- 
pute 

n 



(34) 



1=1 



for j ^ Ail, where Xj__A4j is the sub- vector of Xj consisting of those elements in A^i. 
This measures the additional contribution of variable Xj in the presence of variables 

(2) 

^Mi- Pick k2 variables with the smallest {Lj , j Aii} and let A2 be the resulting 
set. 

3. (Moderate-scale selection) Use penalized likelihood to obtain 



^2 = argmin^jj^^^^^^^^n ^'^L{Yi,(3o + + ^l^^pj^^) + ^ P\{\l3j\)- 

i=l j&MiVJA2 

(35) 

This gives new active indices M2 consisting of n qnvanishing demeri ts of /32- This step 
also deviates importantly from the approach in iFan and Lvl (|2008l ) even in the least 
squares case. It allows the procedure to delete variables from the previous selected 
variables M.i. 

4. (Iteration) Iterate the above two steps until d (a prescribed number) variables are 
recruited or = A^^_i. 



The final estimate is then 13 j^^. In implementation, I Fan. Samworth and Wul (|2009l ) choose 
ki = [2d/3j, and thereafter at the r-th iteration, take kr = d — |A^r-i|- This ensures that 
the iterated versions of SIS take at least two iterations to terminate. The above method 



can be considered as an analogue of the le ast squares ISIS pr oced ure (IFan and Lvl (120081')) 
witho ut explicit definition of the residuals. iFan and Lvl (j2008l ) and iFan. Samworth and Wu 
(|2009l ) show empirically that the ISIS significantly improves the performance of SIS even in 
the difficult cases described above. 



5 Sampling properties of penalized least squares 

The sampling properties of penalized likelihood estimation ([2|) have been extensively stud- 
ied, and a significant amount of work has been contributed to penalized least squares (Hj). 
The theoretical studies can be mainly classified into four groups: persistence, consistency 
and selection consistency, the weak oracle property, and the oracle property (from weak to 
strong). Again, persistence means consistency of the risk (expected loss) of the estimated 
model, as opposed to consistency of the estimate of the parameter vector under some loss. 
Selection consistency means consistency of the selected model. By the weak oracle property, 
we mean that the estimator enjoys the same sparsity as the oracle estimator with asymp- 
totic probability one, and has consistency. The oracle property is stronger than the weak 
oracle property in that, in addition to the sparsity in the same sense and consistency, the 
estimator attains an information bound mimicking that of the oracle estimator. Results have 
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revealed the behavior of different penalty functions and the impact of dimensionality on high 
dimensional variable selection. 



5.1 Dantzig selector and its asymptotic equivalence to LASSO 



The Li regularization (e.g., LASSO) has received much attention due to its convexity 
and encouraging sparsity solutions. The idea of using the L-\ norm can be traced bac k 
to the introduction of convex relaxation for deconvolution in IClaerbout and Muiii (|l973l ). 



Taylor. Banks and McCoyI (jl979l ). and lSantosa and Symed (jl986l ). The use of the Li penalty 
has been shown to have close connections to oth e r met hods. For example, sparse approxi- 
mation us ing an Lj app roach is shown in iGirosil (|l998l ) to be equivalent to support vector 
machines (jVapnikI (|l995l )) for noiseless data. Another example is the asymptotic equivalence 
between the Dantzig selector (jCandes and Tad (|2007l )) and LASSO. 

The -Li regularizat ion has also been used in the Dantzig selector recently proposed by 
Candes and Tad (j2007l ). which is defined as the solution to 



min subject to ||n~^X'^(y - X/3)||oo < A, 



(36) 



where A > is a regularization parameter. It was named after Dantzig because the convex 
optimization problem (j36p can easily be recast as a linear program. Unlike the PLS dH) which 
uses the residual sum of squares as a measure of goodness of fit, the Dantzig selector uses the 
Loo norm of the covariance vector n~^X^(y — X/3), i.e., the maximum absolute covariance 
between a covariate and the residual vector y — X/3, for controlling the model fitting. This 
Loo constraint can be viewed as a relaxation of the normal equation 



X^X/3, 



(37) 



namely, finding the estimator that has the smallest Li-norm in the neighborhood of the 
least squares estimate. A prominent feature of the Dantzig selector is its nonasymptotic 
oracle inequalities under L2 loss. Consider the Gaussian linear regression model ([3]) with 
e ~ N{0, a'^In) for some o" > 0, and assume that each covariate is standardized to have L2 
norm ^/n (note t hat we changed the sca le of X since it was assumed that each covariate has 



unit L2 norm in lCandes and Tad (j2007l )). Under the uniform uncertainty principle (UUP) 



on the design matrix X, a condition on the finite condition number for submatrices of X, 
they show that, with high probability, the Dantzig selector (3 mimics the risk of the oracle 
estimator up to a logarithmic factor logp, specifically 



11/3 



(38) 



where /3o = (/So.ir'' ^Po,p)^ is the vector of the true regression coefficients, C is some 
positive constant, and A ~ -y/(21ogp)/n. Roughly speaking, the UUP condition (see also 



Donoho and StarkI (119891 ): iDonoho and Hud (|200ll )) requires that all n x d submatrices of 
X with d comparable to ||/3ollo are uniformly close to ortho normal mat r ices, which can 
be stringent in high dimensions. See I Fan and Lvl (|2008l ) and ICai and Lvl (|2007l ) for more 
discussions. The oracle inequality ([38l) does not infer much about the sparsity of the estimate. 
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Shortly after the work on the Dantzi g selector, it was observed that the Dantzig selector 
and the LASSO share some similarities. iBickel. Ritov and Tsvbakovl (|2008l ) present a theo- 
retical comparison of the LASSO and the Dantzig selector in t he general high dimensional 
nonpa r ametric regression model. Under a sparsity scenario, IBickel. Ritov and Tsvbakov 
(|2008l ) derive parallel oracle inequalities for the prediction risk for both methods, and es- 
tablish the asymptotic equivalence of the LASSO estimator and the Dantzig selector. More 
specifically, consider the nonparametric regression model 



(39) 



where f = (/(xi), • • • , /(x„))"^ with / an unknown p-variate function, and y, X = (xi, • • • , x^)"^, 
and £ are the same as in ([3l). Let , /m) b e a fin ite dictionary of p-variate func- 



tions. As pointed out in IBickel. Ritov and Tsvbakovl (|2008l ). fj's can be a collection of basis 
functions for approximating /, or estimators arising from M different methods. For any 
P = (/?!,... ,pMf, define fp = J2f=iPjfj- Then similarly to dD and the LASSO 

estimator fi and Dantzig selector fi) can be defined accordingly as and with 

and /3£) the corresponding M- vectors of minimizers. In both formations, the empirical norm 

\\Mn = 



n ^X]"=i/J(xj) of fj is incorporated as its scale. IBickel. Ritov and Tsvbakov 
(|2008l ) show that under the restricted eigenvalue condition on the Gram matrix and some 
other regularity conditions, with significant probability, the difference between ||//) — /||,^ 
and Wfi^ — /ll^ is bounded by a product of three factors. The first factor sa^ jn corresponds 
to the prediction error rate in regression with s parameters, and the other two factors in- 
cluding log M reflect the impact of a large number of regressors. They further prove sparsity 
oracle inequalities for the prediction loss of both estimators. These inequalities entail that 
the distance between the prediction losses of the Dantzig selector and the LASSO estimator 
is of the same order as the distances b etween them and their oracle approximations. 



Bickel. Ritov and Tsvbakovl (j2008l ) also consider the specific case of a linear model ([3]), 
say (f39l) with true regression function f = X/3q. If e ~ A^(0, cr^/„) and some regularity 
conditions hold, they show that, with large probability, the Lg estimation loss for 1 < (7 < 2 
of the Dantzig selector ^j-, is simultaneously given by 



11/3 



/^ollo 



/ I \2{g-i) 



log^y/^ 



n 



(40) 



where s = ||/3o||o) "t- ^ s is associated with the strong restricted eigenvalue condition on the 
design matrix X, and C is some positive constant. When q = 1, they prove (HOl) under a 
(weak) restricted eigenvalue condition that does not involve m. IBickel. Ritov and Tsvbakov 
(|2008l ) also derive similar inequalities to (|40p with slightly different constants on the Lq 
estimation loss, for 1 < (7 < 2, of the LASSO estimator These results demonstrate the 
approximate equivalence of the Dantzig selector and th e LASSO. The similarity between th e 
Dantzig select or and LASSO has also been discussed in lEfron. Hastie and Tibshiranil (j2007l ). 
Lounicil ([200a) derives the Lqo convergence rate and studies a sign concentration property 
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simultaneously for the LASSO estimator and the Dantzig selector under a mutual coherence 
condition. 

Note that the covariance vector n~^X"^(y — X/3) in the formulation of Dantzig selector 



()36p is exactly the negative gradient of (2n)~^||y — X/3|p in PLS This in fact entails that 
the Dantzig selector and the LASSO estimator are identical under some suitable conditions, 
pro vided that the same regularization parameter A is used in both methods. For exam- 
ple, iMeinshausen. Rocha and Yul (120071 ) give a diagonal dominance condition of the p x p 
matrix (X'^X)~^ that ensures th eir equivalence. This condition implicitly assumes p < n. 
James. Radchenko and Lvl (|2009l ) present a formal necessary and sufficient condition, as well 
as easily verifiable sufficient conditions ensuring the identical solution of the Dantzig selector 
and the LASSO estimator when the dimensionality p can exceed sample size n. 



5.2 Model selection consistency of LASSO 

There is a huge literature devoted to studying the statistical properties of LASSO and re- 
lated methods. This Li method as well as its variants ha ve also been extensively stud - 



ied in such other areas as compressed sensing. For example, iGreenshtein and Ritovl (|2004l ) 



show that under some regularity conditions the LASSO-type p rocedures are persi stent un 



der quadratic loss for dimensionality of pol ynomial growth, and iGreenshteinI (|2006l ) extends 



the results to more general loss functions. iMeinshausenI (j2007l ) presents similar results for 



the LASSO for dimensionality of exponential growth and finite nonsparsity size, but its 



consistency results see 


Donoho. Elad anc 


Temlvakov ( 


2006 




Meinshausen and Biihlmann 


(200fill 


WainwriffhtJ (20od^. 


Zhao and Yu 


(2006). 


Bunea. Tsvbakov and WegkamcJ ( 


200?!), 


Bickel. Ritov and Tsvbakov ( 


20081). 


van de Geei 


(20081). andlzhang and Huang ( 


2008) 


amon 



others. 

As mentioned in the previous section, consistency results for the LASSO hold under 
some conditions on the design matrix. For the purpose of variable selection, we are also 
concerned with the sparsity of the estimator, particularly its model selection consistency 
meaning that the estimator /3 has the same support as the true regression coefficients vector 
/3q with asymptotic probability one. IZhao and Yul (|2006l ) characterize the model selection 
consistency of the LASSO by studying a stronger but technically more convenient property 
of sign consistency: P(sgn(/3) = sgn(/3o)) — > 1 as n — > oo. They show that the weak 
irrepresentable condition 

||Xl^Xi(XfXi)-isgn(/3i)|U < 1 (41) 

is necessary for sign consistency of the LASSO, and the strong irrepresentable condition, 
which requires that the left-hand side of (j41|) be uniformly bounded by a positive constant 
C < 1, is sufficient for sign consistency of the LASSO, where (3i is the subvector of Pq on its 
support supp(/3o), and Xi and X2 denote the submatrices of the nx p design ma trix X formed 
by columns in supp(/3o) and its complement, respectively. See also IZoul (120061 ) for the fixed 
p case. However, the irrepresentable condition can become restrictive in high dimensions. 
See Section 15.51 for a simple illustrative example, because the same condition shows up in 
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a related problem of sparse recovery by using Li regularization. This demonstrates that in 
high dimensions, the LASSO estimator can easily select an inconsistent model, which explains 
why the LASSO tends to include many false positive variables in the selected model. 

To establish the weak oracle property of the LASSO, in addition to the sparsity charac- 
terized above, we need its consistency. To this end, we usually need the condition on the 
design matrix that 



Xi(Xf Xi) 



< c 



(42) 



for some positive constant C < 1, which is stronger than the strong irrepresentable condition. 
It says that the Li-norm of the regression coefficients of each inactive variable regressed on 
s active variables must be uniformly bounded by C < 1. This shows that the capacity of the 
LASSO for selecting a consistent model is very limited, no ticing also that the Li -norm of the 
regression coefficients typically increase with s. See, e.g.. IWainwrightl (|2006l ). As discussed 
above, condition (|42p is a stringent condition in high dimensions for the LASSO estimator 
to enjoy the weak oracle property. The model s election consistency of the LASSO in the 
context of graphical models has been studied bv iMeinshausen and Biihlmannl (|2006l ). who 
consider Gaussian graphical models with polynomially growing numbers of nodes. 



5.3 Oracle property 

What are the sampling properties of penalized least squares dH) and penalized likelihood 
estimation ^ when the penalty function p\ is no longer convex? The oracle property 



(|Fan and Lil (|200ll )) provides a nice conceptual framework for understanding the statistical 
properties of high dimen sional variable sele ction methods. 

In a seminal paper, iFan and Lil (120011 ) build the theoretical foundation of nonconvex 
penalized least squares or, more generally, nonconcave penalized likelihood for variable se- 
lection. They introduce the oracle property for model selection. An estimator /3 is said to 
have the oracle property if it enjoys sparsity in the sense that = with probability tending 
to 1 as n — > 00, and attains an information bound mimicking that of the oracle estimator, 
where (Bi and are the subvectors of /3 formed by components in supp(/3q) and supp(/3o)'^, 
respectively, while the oracle knows the true model supp(/3o) beforehand. The oracle proper- 
ties of penalized least squares es timators can be un derstood in the more general framework of 
penalized likelihood estimation. iFan and Lil (j200ll ) study the oracle pr operties of nonconcave 
penalized likelihood estimators in the finite-dimensional setting, and iFan and PengI ([200J) 
extend their results to the moderate dimensional setting with p = oip}^^) or o(n^/^). 

More specifically, without loss of generality, assume that the true regression coefficients 
vector is /3q = (/9f,/3f')^ with (3^ and the subvectors of nonsparse and sparse elements 



respectively : ||/3 



/3nlln and 0') = 0. Let a. 



o and = ||y{(|/3i|)||oo. 
Fan and Lil (|200ll ) and lFan and Pend (|2004l ) show that, as long as a„, 5„ = o(l), under some 



regularity conditions there exists a local maximizer (3 to the penalized likelihood ^ such 
that 



||/3-/3ol|2 = Op(v^C 



n 



-1/2 



+ an)). 



(43) 
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This entails that choosing the regularization parameter A with a„ = 0(n^^/^) gives a root- 
{n/p) consistent penahzed hkehhood estimator. In particular, this is the case when the 
SCAD pena l ty is used if A = o(mini<j<s |/?oj|)! where f3i = (/So.ir'" )/3o,s)^- Recently, 
Fail and Lvl (120091) ga ve a sufficient cond i tion u nder which the solution is unique. 



Fan and Lil (j200lh and lFan and Penei (|2004l ) further prove the oracle properties of penal- 
ized likelihood estimators under some additional regularity conditions. Let S = diag{p'j[ (1/3^1)} 
and p\{f3i) = sgn(/3i) °p'\i\l3i\), where o denotes the the Hadamard (componentwise) prod- 
uct. Assume that A = o(mini<j<s |/3o,il)i ^/ra/pA ^ oo as n — > cxd, and the penalty function 
p\ satisfies liminf„_^oo liniinft^o+PA(^)/-^ ^ ^- They show that \i p = o(n^/^), then with 



probability tending to 1 as n - 
satisfies the following 

a) (Sparsity) 32 = 0; 

b) (Asymptotic normality) 

-1/2 



OO, the root-(n/p) consistent local maximizer /3 = {f3i ,(32 



V^A„I-^/'(Ii + S)[/3i -(3, + (Ii + i:)-'pxi(3,)] ^ iV(o, G), 



(44) 



where A„ is a qx s matrix such that A„A^ G, a q x q symmetric positive definite matrix, 
Ii = I(/9i) is the Fisher information matrix knowing the true model supp(/3o), and Pi is a 
subvector of j3 formed by components in supp(/3o)- 

Consider a few penalties. For the SCAD penalty, the condition A = o(min|/3^|) entails 
that both px{(3i) and S vanish asymptotically. Therefore, the asymptotic normality (I44p 
becomes 



V^Aniy^Pi - Pi) ^ N{0,G 



9 



(45) 



which shows that P^ has the same asymptotic efficiency as the MLE of Pi knowing the true 
model in advance. This demonstrates that the resulting penalized likelihood estimator is as 
efficient as the oracle one. For the Li penalty (LASSO), the root-(n/p) consistency of /3 
requires A = a„ = OirT^I'^), whereas the oracle property requires \J n/p\ — > oo as n — > oo. 
However, these two conditions are incompatible, which suggests that the LASSO estimator 
generally does not have the oracle property. This is intrinsically due to the fact that the Li 
penalty does not satisfy the unbi asedness co ndition. 

It has indeed been shown in IZoul (j2006l ) that the LASSO estimator does not have the 
oracle property even in the finite parameter setting. To address the bias issue of LASSO, he 
proposes the adaptive LASSO by using an adaptively weighted Li penalty. More specifically, 
the weight vector is \P\~'^ for some 7 > with the power understood componentwise, where /3 
is an initial root-n consistent estimator of /3q. Since /3 is root-n consistent, the constructed 
weights can separate important variables from unimportant ones. This is an attempt to 
introduce the SCAD-like penalty to reduce the biases. From ()12p . it can easily be seen that 
the adaptive LASSO is just a specific solution to penalized least squares using LLA. As a 
consequence, IZoul (j2006l ) shows that the adaptive LASSO has the oracle property under some 



regularity conditions. See also IZhang and Huand (|2008l ) 
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5.4 Additional properties of SCAD estimator 

In addition to the oracle prop erties outlined i n the last section and also in Section 6.2, 
Kim. Choi and Qhl (120081 ) and iKim and KwonI (120091 ) provide insights into the SCAD es- 
timator. They attempt to answer the question of when the oracle estimator ^9 is a local 
minimizer of the penalized least squares with the SCAD penalty, when the SCAD estima- 
tor and the oracle estimator coincide, and how to check whether a local minimizer is a 
global minimizer. The first two results are indeed stronger than the oracle property as they 
show that the SCAD estimator is the oracle estimator itself rather than just mimicking its 
performance. 

Recall that all covariates have been standardized. The follow assumption is needed. 

Condition A: The nonsparsity size is s„ = O(n^i) for some < ci < 1, the minimum 
eignvalue of the correlation matrix of those active variables is bounded away from zero, 
and the minimum signal mini<j<s„ > c^n^^^^'^'^^^'^ for some constant C2 G (ci, 1]. 



Under Condition A, iKim. Choi and Qhl (120081 ) prove that if Es^ < oo for the linear model 

P{f3 is a local minima of PLS with the SCAD penalty) — > 1, (46) 

provided that A„ = o(n-ti-('=2-ci)}/2^ ^ o{{^/n\nf''} . This shows that the SCAD 

method produces the oracle estimator. When k is sufficiently large, the dimensionality 
Pn can be of any polynomial order. For the Gaussian error, the result holds even with 
NP-dimensionality. More precisely, for the Gaussian errors, they show that (|46|) holds for 
Pn = 0(exp(c4n)) and A„ = 0(n~^^~'^5^/^), where < C4 < C5 < C2 — ci. The question 
then arises naturally whether th e global minimizer of pena lized least squares with the SCAD 
penalty is the oracle estimator. iKim. Choi and Qhl (j2008l ) give an affirmative answer: with 
probability tending to one, the global minimizer of penalized least squares with the SCAD 
penalty is the same as the oracle estimator when the correlation matrix of all covariates is 
bo unded away frorn zero and infinity (necessarily, Pn ^ n). 



Kim and KwonI (|2009l ) also give a simple condition under which the SCAD e s timat or is 



unique and is a global minimizer (see also the simple conditions in I Fan and Lvl (120091 ) for 
a more general problem). They also provide sufficient conditions to check whether a local 
minimizer is a global minimizer. They show that the SCAD method produces the oracle 
estimator, 

P{The SCAD estimator = f3"} ^ 1, 

under conditions similar to Condition A, even when the minimum eigenvalue of the correla- 
tion matrix of all variables converges to zero. 



5.5 Sparse recovery and compressed sensing 

Penalized Li rn ethods have been widely applied in areas including compressed sensing 
(jDonohol (|2006al )). In those applications, we want to find good sparse representations or 
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approximations of signals that can greatly improve efficiency of data storage and transmis- 
sion. We have no intent here to survey results on compressed sensing. Rather, we would 
like to make some innate connections of the problem of sparse recovery in the noiseless case 
to model selection. Unveiling the role of penalty functions in sparse recovery can give a 
simplified view of the role of penalty functions in high dimensional variable selection as the 
noise level approaches zero. In particular, we see that concave penalties are advantageous in 
sparse recover y, which is i n line with the advocation of folded concave penalties for variable 
selection as in lFan and Lil (j200ll ). 

Consider the noiseless case y = X/Jg of the linear model dS]). The problem of sparse 
recovery aims to find the sparsest possible solution 



argmin ||/3||o subject to y = X/3. 



(47) 



The solution to y = X/3 is not uniqu e when the n x p matrix X has rank less than p, e.g. 



when p > n. See iDonoho and EladI (|2003l ) for a characterization of the identifiability of 
the minimum Lq solution Pq. Although by its nature, the Lq penalty is the target penalty 
for sparse recovery, its computational complexity makes it infeasible to implement in high 
dimensions. This motivated the use of penalties that are computationally tractable relax- 
ations or approximations to the Lq penalty. In particular, the convex Li penalty provides 
a nice convex relaxation and has attracted much attention. For properties of various Li 
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More generally, we can replace the Lq penalty in (f^TIl by a penalty function p(- 
consider the p-regularization problem 



mm 



^^/3(|/3j|) subject to y = X/3. 



and 



(48) 



This constrained optimization problem is closely related to the PLS in A great deal 
of research has contributed to identifying conditions on X and (3q that ensure th e Li /Lo 
equiva lence, i.e., the Li-regularization (f48l) gives the same solution (3q. For example. iDonoho 
(j2004l ) contains deep results and shows that t he individual equivalence of L^ / Ln depends onl y 
on supp(/3o) and / 3n on its support. See also iDonoho and Hud (120011 ) and iDonohd (j2006bl ). 
In a recent work, iLv and FanI (120091 ) present a sufficient condition that ensures the p/Lq 
equivalence for concave penalties. They consider increasing and concave penalty functions 
p{-) with finite maximum concavity (curvature). The convex Li penalty falls at the boundary 
of this class of penalty functions. Under these regularity conditions, they show that (3q is a 
local minimizer of (08]) if there exists some e S (0, minj<s |/3oj |) such that 



max ||Xi^Xi(XfXi)-iu||oo < p (0+), 



(49) 



where 
sections. 



{sgn{Pi) o p'(|v|) : ||v — /3i||oo < e}, the notation being that of the previous two 
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When the Li penalty is used, contains a single point sgn(/3i) with sgn understood 
componentwise. In this case, condition (j49p becomes the weak irrepresentable condition (|4ip . 
In fact the Li/Lq equivalence holds provided that (j4ip weakened to nonstrict inequality is 
satisfied. However, this condition can become r estrict ive in high dimensions. To appreciate 
this, look at an example given in iLv and FanI (|2009l ). Suppose that Xi = (xi,--- ,Xs) is 



orthonormal, y = Poj^j with |/3o,i| = • • • = \Po,s\, ^s+i has unit L2 norm and corre- 

lation r with y, and all the rest of the Xj's are orthogonal to {xj}j^^. The above condition 
becomes \r\ < s^^/^. This demonstrates that the Li penalty can fail to recover the sparsest 
solution /3q when the maximum correlation of the noise variable and response is moderately 
high, which, as explained in the Introduction, can easily happen in high dimensions. 

On the other hand, the concavity of the penalty function p entails that its derivative 
p'{t) is deceasing in t G [0, 00). Therefore, condition (P9j) can be (much) less restrictive for 
concave penalties other than Li. This shows the advantage of concave penalti es in sparse 
recove ry, which is consistent with similar understandings in variable selection in lFan and Li 
(I2OO1I ). 



6 Oracle property of penalized likelihood with ultra-high di- 
mensionality 

As shown in Section [H large-scale screening and moderate-scale selection is a good strategy 
for variable selection in ultra- high dimensional feature spaces. A less stringent screening (i.e., 
a larger selected model size in (I24p ) will have a higher probability of retaining all important 
variables. It is important to study the limits of the dim ensionality that nonconcave penalized 



likelihood methods can handle. The existing result of iFan and Peng (|2004l ) is too weak in 
terms of the dimensionality allowed for high dimensional modeling; they deal with too broad 
a class of models. 

What are the roles of the dimensionality p and nonsparsity size s? What is the role 
of penalty functions? Does the oracle property continue to hold in ultra-high dimensional 
feature spaces? These questions have been driving the t h eoreti cal development of high di- 
mensional variable selection. For example, iKoltchinskiil (120081) obtain s orac le inequalities 



for penalized least squares with entropy penalization, and Ivan de Geerl (120081 ) establishes a 
nonasymptotic oracle inequality for the Lasso estimator as the empirical risk minimizer in 
high dimensional generalized linear models. There are relatively few studies on the statis- 



nonconvex penalties. More recent studies on this topic include 


Huane. Horowitz and Ma 


(2008 


)^ 
), 


Kim. Choi and Oh (2008). 


Meier, van de Geer and Biihlmannl ( 


20081). 


Lv and Fan 


(2009 


Zhang 


( 


2009). and Fan and Lv ( 


2009 


), among others. 



6.1 Weak oracle property 

An important step towards the understanding of the oracle property of penalized likelihood 
methods in ultra-high dimensions is the weak oracle property for model selection, introduced 
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by ILv and FanI (|2009l ) in the context of penalized least squares. An estimator j3 is said to 
have the weak oracle property if it is uniformly consistent and enjoys sparsity in the sense 
of = with probability tending to 1, i.e. model selection consistency, where is 
the subvector of j3 formed by components in supp(/3o)'^ and the oracle kn ows the true mode l 
supp(/3o) beforehand. This property is weaker than the oracle property in i Fan and Lil (|200lh . 
Consistency is derived under loss, mainly due to the technical difficulty of proving the 
existence of a solution to the nonlinear equation that characterizes the nonconcave penalized 
likelihood estimator. It is important to study the rate of the probability bound for sparsity 
and the rate of convergence for consistency. The dimensionality p usually enters the former 
rate explicitly, from which we can see the allowable growth rate of p with sample size n. 
Consider the PLS problem dH) with p enalty function p\ . Let p{t\X) = X'^pxjt) and write 



it as p{t) whenever there is no confusion. iLv and FanI (j2009l ) and lFan and Lvl (|2009l ) consider 
the following class of penalty functions: 

• p{t;X) is increasing and concave in t G [0, oo), and has a continuous derivative p'{t;X) 
with p'{0+;X) > 0. In addition, p'{t;X) is increasing in A G (0,oo) and /o'(0+;A) is 
independent of A. 

This is a wide class of concave pe nalties including SCAD and MCP, and the Li penalty at 
its boundary. iLv and FanI (j2009l ) establish a nonasymptotic weak oracle property for the 
PLS estimator. They consider dSj) with e ~ A^(0, cr^In)- The notation here is the same as in 
Section m Assume that each column of the nxp design matrix X (covariate) is standardized 
to have L2 norm ^/n (or of this order), and let d„ = 2~^ min {|/3o,j | : Poj 7^ 0} be the minimal 
signal. The following condition is imposed on the design matrix 



|X^Xi(Xf X] 



<min(c4^,OK^)) 

Pidn) 



(50) 



where ai > 0, C E (0,1), and p is associated with the regularization parameter A 



n 



ai-l/2„ 



Here is a sequence of positive numbers diverging to infinity. Clearly, for 

the Li penalty, condition (1501) becomes (1421) whic h is a somewhat stronger form of the strong 
irrepresentable condition in I Zhao and Yul (j2006l ). Condition (j50p consists of two parts: the 
first part is intrinsic to the penalty function whereas the second part is purely a technical 
condition. For folded-concave penalties other than Li, the intrinsic condition is much more 
relaxed: the intrinsic upper bound is C < 1 for the Li penalty whereas it is 00 when dn ^ A„ 
for the SCAD type of penalty. In other words, the capacity for LASSO to have model se- 
lection consistency is limited, independent of model signals, whereas no limit is imposed for 
SCAD type of penalties when the signals are strong enough. In general, the concavity of />(•) 
guarantees condition (j50p and is more relaxed than the Li penalty. 

Under the above an d some addi t ional regularity conditions, if ||(n~^X^Xi)^^||oo = 
0(77-"") for some oq > 0, iLv and FanI (j2009l ) show that for sufficiently large n, with proba- 
bility at least 1 — -^pu~^e~^'"^'^ , the PLS estimator /3 = {(3i satisfies the following 



a) (Sparsity) (32 = 0; 
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b) (Loo loss) ||/3i - (3 



1 Hoc 



0(n 



a„-l/2 



where is a subvector of /3 formed by components in supp(/3o)- In particular, when the 
signals are so sparse that s is finite, oq = for all non-degenerate problems. In this case, 
by taking = clogp for c > 2 so that the probability 1 — ;^pu~^e~""/^ 1, we have 

11/^1 ~ /^illoo = Op (n^ ""^/^ \/log p) ■ As an easy consequence of the general result, 



||3-/3oll2 = Op(yin 



00-1/2 



(51) 



when p = o(ii„e""/^). The dimensionality p is allowed to grow up to exponentially fast 
with Un- More specifically, ti„ can be allowed as large as o(n^/^~"°~"i(i„) and thus logp = 
o(n^~^*^"''~''"i)(i^). This shows that a weaker minimal signal needs slower growth of dimen- 
sionality for successful variable selection. From their studies, we also see the known fact that 
concave pen alties can reduce th e biases of estimates. 



Recently, iFan and Lvl (120091 ) extended the results of ILv and FanI (120091 ) and established 
a nonasymptotic weak oracle property for non-concave penalized likelihood estimator in 
generalized linear models with ultra-high dimensionality. In their weak oracle property, they 
relax the term n„ from the consistency rate. A similar condition to (jSOp appears, which shows 
the drawback of the Li penalty. The dimensionality p is allowed to grow at a non-polynomial 
(NP) rate. Therefore, penalized likelihood methods can still enjoy the weak oracle property 
in ultra-high dimensional space. 



6.2 Oracle property with NP-dimensionality 

A lon g-standing questio n is whether the penalized likelihood methods enjoy the oracle prop- 
ert y (IFan and Lil (120011 )) in ultra-high dimensions. This issue has recently been addressed 
by iFan and Lvl (j2009l ) in the context of generalized linear models. Such models include the 
commonly used line ar, logistic, a n d Poi sson regression models. 

More specifically iFan and Lvl ([20091) show that, under some regularity conditions, there 
exists a local maximizer /3 = {Pi , )^ of the penalized likelihood ([2]) such that = with 
probability tending to 1 and ||/3 — /3o||2 = Op{\/sn~'^^'^), where f5i is a subvector of /3 formed 
by components in supp(/3o) and s = ||/3o||o- They further establish asymptotic normality 
and thus the oracle property. The conditions are less restrictive for such concave penalties 
as SCAD. In particular, their results suggest that the Li penalized likelihood estimator 
generally cannot achieve the consistent rate of Op(y^n~^/^) and does not have the oracle 
property when the dimens i onalit y jp is diverging with t he sa mple size n . This is consistent 
with results in lPan and IJ (|200lh . IPan and Peii3 ([2OOJ), and[zo3 (|2006l ). 

It is natural to ask when the non-conc ave penalized likeli hood estimator is also a global 



maximizer of the penalized likelihood ([2]). iFan and Lvl (120091 ) give characterizations of such 



a property from two perspectives: global optimality and restricted global optimality. In 
particular, they show that under some regularity conditions, the SCAD penalized likelihood 
estimator can be identical to the oracle estimator. This feature of the SCAD penalty is not 
shared by the Li penalty. 
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7 Concluding remarks 



We now have a better picture of the role of penalty functions and the impact of dimensionahty 
on high dimensional regression and classification. The whole story of high dimensional 
statistical learning is far from complete. New innovative techniques are needed and critical 
analyses of their relative merits are required. Issues include the characterization of optimality 
properties, the selection of data-driven penalty functions and parameters, the confidence 
in selected models and estimated parameters, group variable selection and its properties, 
inference after model selection, the incorporation of information on covariates, nonparametric 
statistical learning, manifold learning, compressed sensing, developments of high dimensional 
statistical techniques in other important statistical contexts, and development of robust and 
user-friendly algorithms and software. High dimensional statistical learning is developed to 
confront and address the challenges in the frontiers of scientific research and technological 
innovation. It interfaces nicely with many scientific disciplines and will undoubtedly further 
advances on emerging societal needs. 
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